We’ve already spent time with supervised learning, a model with an outcome variable. Specifically, we dealt with regression and classification. How are they different?
We used supervised learning for inference (i.e., to understand the underlying data generating process), but now we only care about prediction. So instead of worrying about the best model for inference, we’ll need to run a lot of models and find which one is best.
Let’s import and work with some new data.
# Load packages. library(tidyverse) library(tidymodels) library(dbplyr) library(DBI) # Set a simulation seed. set.seed(42)
The password is practicemakes. It is bad form to save a password in your code.
# Connect to the database.
con <- dbConnect(
RPostgreSQL::PostgreSQL(),
dbname = "analyticsdb",
host = "analyticsdb.ccutuqssh92k.us-west-2.rds.amazonaws.com",
port = 55432,
user = "quantmktg",
password = rstudioapi::askForPassword("Database password")
)
# Look at the available data tables.
dbListTables(con)
# Import from the database.
roomba_survey <- tbl(con, "roomba_survey") |>
collect()
# Disconnect.
dbDisconnect(con)
# Write data locally.
roomba_survey |>
select(-row.names) |>
write_csv(here::here("Data", "roomba_survey.csv"))
We can use the survey as a data dictionary.
# Answers to S1? roomba_survey |> count(S1)
## # A tibble: 3 × 2 ## S1 n ## <dbl> <int> ## 1 1 40 ## 2 3 63 ## 3 4 229
Previously we were a little lazy and did some feature engineering (i.e., preprocessing) of the outcome variable at the same time as the predictors. We can run into problems that way. Get your outcome variable ready first and leave feature engineering to the features (i.e., predictors).
# Wrangle S1 into segment.
roomba_survey <- roomba_survey |>
rename(segment = S1) |>
mutate(
segment = case_when(
segment == 1 ~ "own",
segment == 3 ~ "shopping",
segment == 4 ~ "considering"
),
segment = factor(segment)
)